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SUMMARY 


In this study a parallel program to analyze transient 
finite element problems was written and implemented on a 
system of transputer processors. The program uses the 
explicit time integration algorithm which eliminates the need 
for equation solving making it more suitable for parallel 
computations. An interprocessor communication scheme was 
developed for arbitrary two-dimensional grid processor 
configurations. Several 3-D problems were analyzed on a 
system with a small number of processors. 
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number of processors 
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prep 
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tot 


time step 
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which direction a transputer 


u a a ^ t f « /*^ a ^ 
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direction a transputer 


information in which 
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communication time, 
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the time required to 
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1 . 0 INTRODUCTION 


Today computers are widely used in the engineering field 
for the analysis of complex problems and to aid in the design 
of new components. Despite the impressive speed of the 
current generation of computers, there are many problems such 
as those involving three-dimensional analysis 
disciplinary optimization for which even the speed of today's 
supercomputers is not sufficient. Also, the solution time for 
many large scale problems needs to be greatly reduced before 
they can be effectively incorporated into the engineering 
design process. In an attempt to achieve a major increase in 
speed of computers, attention has focused on the development 

of parallel processing computers. 

With parallel computers several processing units are 

connected together with the idea of subdividing a given 
problem into separate tasks that can be performed 
independently on the different processors. Theoretically, 
this approach gives a decrease in computation time over a 
traditional sequential computer which performs all the tasks 


in a sequential fashion. 

Here the application of parallel computations to the 
analysis of transient finite element problems will be 
investigated. Transient finite element problems are among the 
most computationally intensive because the time history of 
interest must be divided into small steps and the solution to 
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the problem must be computed progressively at each successive 
step in time. These types of problems arise in the modelling 
of automobile crashworthiness, nuclear accidents and fluid- 
structure interaction in liquid storage tanks and generally 
involve large three-dimensional meshes and nonlinear 
deformations and material behavior. 

The purpose of this study was to implement a three- 
dimensional transient finite element program on a system of 
transputer processors. A two-dimensional grid transputer 
processor configuration was chosen as the most appropriate for 
the finite element problems of interest in this study. In 
conjunction with the finite element program an interprocessor 
communication algorithm was developed that can accommodate any 
number of processors in an arbitrary grid configuration and 
can adapt the distance allowable communication to suit 
different problems. Some simple test problems of various 
sizes were evaluated on a four processor transputer system to 
study the effects of communication time on the efficiency of 
the parallel computation. 


2 


2 . 0 GOVERNING EQUATION 


The governing finite element equations 
dynamics problems can be written in the form 

E a + I ■ F 


for structural 
(2-1) 


where 


£ - external force vector 
f - internal force vector 
M - mass matrix 

a - nodal acceleration vector 
For linear systems the internal force vector can be 
by the following formula 

_£ = K d 


expressed 


(2-2) 


here 


K — structure stiffness matrix 
d - nodal displacement vector 
and consequently equation (2-1) can be rewritten as 

£! a + K £ = I 


(2-3) 


The initial conditions for the above equation are given by 

d° = d (t=0) ( 2-4a) 

y° = v(t=o) ( 2-4b) 

where y is the vector of nodal velocity. Thus, the initial 
value problem consists of finding i(t) satisfying equations 


(2-3) and (2-4) for t>0. 

Often it is necessary to analyze free vibrations of a 
structure in which case the vector of nodal displacement at 
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t=0 is assumed known and £=0 and y°=0 . 

The derivation of these governing equations can be found 
in many books on the finite element method* 1, 


"parenthetical references placed superior to the line of the text refer to the bibliography. 
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3.0 NUMERICAL INTEGRATION ALGORITHM 

In this section the procedures used to solve the initial 
value problem will be presented. The most general method of 
solution is referred to as direct integration and involves 
dividing the time period of interest into steps and 
progressively computing the solution at each step in time. 

Perhaps the most popular direct integration method is the 

Newmark-Beta roethod <3, ° and is given by 

= Y n + At l(l-Y) 2 ? + Y a 0 * 1 ] (3-1) 

= £ n + A t Y r ‘ * (At) 2 l(-| - p) a? + P a”* 1 ] (3-2) 

where A t is the time step and gamma and beta are parameters 
that affect the stability and accuracy of the method and have 
the range 0< 0 <1/2, 0< y <1. The superscript notation is 

used to indicate the time, for example d" stands for d(nAt) . 
The most widely used variations of the Newmark-Beta formula 
corresponding to different combinations of y and s are 
given in the Table 1. ; The .various types of Newmark 

integration can be classified into two general categories: 
implicit and explicit depending on whether it is necessary to 
solve a system of linear equations to compute the updated 
values of the solution. 

The first two methods in the table, the trapezoidal 
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method and the central difference method, are the most 
frequently used. The trapezoidal rule has been shown to be 
unconditionally stable in that convergence can be achieved 
with any time step, however, the accuracy of the method can be 
poor if too large a time step is used. The disadvantage of 
this method is that it is implicit and a system of equations 
needs to be solved in order to compute the next value of 
displacement, d^ 1 . Consequently, this method is somewhat more 
difficult to implement in a parallel computation. A flow 
chart for this method is given in Table 2. 

With y - 1/2 and 6=0 the integration formula is called 
the central difference method and is an explicit method. This 
is because there is no need for any equation solving, provided 
that the mass matrix is lumped (diagonal) . This method is 
particularly well suited for parallel computing because the 
displacements and velocities can be updated on different 
processors and only the displacements, used to compute the 
internal forces, need to be exchange after each time step. The 
disadvantage of this method is that it is only conditionally 
stable and the error grows exponentially in time and are 
meaningless if a certain critical time step is exceeded (5) . 

The restriction on the time step is given by 

AC * — (3-3) 
where u rJlx is the maximum frequency computed from the 
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Table 1. 

Numerical integration algorithms 







generalized eigenvalue problem 


XX - o > 2 MX 


(3-4) 


However, it is more convenient to use more conservative 
condition 


At £ 


2 

^max 


(3-5) 


where is the maximum value of all the element 

frequencies. In this case the frequencies can be computed 
from the much smaller element eigenvalue problem 

K*X=u 2 e M e X (3-6) 

and for many elements simple closed form solutions can be 
found for this problem. The flow chart for the central 
difference method is given in Table 3. 
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Table 2. 


Flow chart for the trapezoidal method. 



Given initial conditions: d , v for t 0 


1] Compute K ,13 and 13 1 • 

2] Compute acceleration vector a 

a 0 = K' 1 (F° - K d°) 

3 ] LOOP n=0 FOR number time steps 

a] compute K 

K = M + B A t 2 K 

b] compute f"* 1 

^n+i = 5 At 2 Z"* 1 + i 3 [d n +At v n + Ac (-5“B)a ] 

c] compute K" 1 

d] compute d m1 
d m1 = K* 1 

e] compute a 

a ml = m ' 1 (r* 1 - £ ar') 

f] compute v"* 1 

v"* 1 = v" + A c [ ( 1- Y ) + Y fi"* 1 3 

g] if n - number of time step then terminate, 
otherwise n=n+l and GO TO a] . 
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Table 3. 


Flow chart for the central difference method 


Given initial conditions: d°, v° for t=0. 


1] Compute mass matrix H and its inverse Jf' 1 , and stiffness 
matrix X 

2] Compute acceleration 

a 0 - XUZ° - X <a°) 

3] LOOP n=0 FOR number of time steps 

a] compute external force vector £ n 

b] update displacements 

» d" + A C v n + (Ac 2 /2) a" 


c] compute acceleration 
a^ 1 = M* 1 (F^ 1 - X d"* 1 ) 


d] update velocities 

v"* 1 = v n + ( At/2) (a n + a"* 1 ) 


e] if n = number of time steps then terminate, 
otherwise n = n+1 and GO TO a] . 
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4 . 0 PARALLEL COMPUTATIONS 


The development of finite element programs for parallel 
computers depends strongly on the type of machine that is to 
be used. Parallel processing computers are usually classified 
according to number of processors in the system or by the type 
of memory that is accessible to the processors. In the first 
case parallel computers are usually termed fine grained if 
there are many processors in the system, which may be as many 
as 64,000, or coarse grained if the system is composed of 
relatively few, approximately 5 to 20 , large processors. 

Also parallel computers can be differentiated by the 
architecture of the system memory. With shared memory 
architectures all processors have access to a common global 
memory while with distributed memory systems each processor 
has its own local memory and information exchange takes place 
through interprocessor communication. Because of 

difficulties that arise when different processors try to 
access the same memory at the same time, it is usually the 
case that only coarse grained systems have shared memory and 
local or distributed memory is used for fine grained 

architectures . 

on a shared memory system interprocessor communication is 
not necessary since once a processor writes data into a 
location in the global memory all other processors have access 
to this data. The advantage of this type of design is that 
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computer programming is simplified, however, accessing the 
common global memory can take longer, because contention 
problems can occur when several processors try to access the 
shared memory simultaneously. 

On the other hand with distributed memory systems, memory 
contention problems do not exist, because each processor has 
its own local memory to store data. However, for problems 
where data must be shared between processors an interprocessor 
communication protocol must be developed by the programmer and 
excessive delays in communication can significantly degrade 
the performance of the system. It is because of these memory 
contention and communication problems that parallel processors 
usually do not approach their theoretical problem solving 
speed. 

The purpose of this study was to develop a transient 
finite element program for parallel computation on a 
transputer system of processors. The transputer is a chip 
level processor with local memory and four communication links 
that can be used to connect transputers in a variety of 
configurations. Two different models of transputers are 
currently available. The INMOS T414 transputer has 2 MBytes 
of local memory and floating point performance of 0.1 MFLOPS 
while the INMOS T800 transputer has a floating point 
performance of 1.5 MFLOPS. The INMOS T414 transputer was used 
for the problems in this study. The four transputer links can 
simultaneously transmit data at a rate of 10 MBits per second. 
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The transputers are proved using the OCCAM language- 
which was specifically developed to facilitate parallel 
programming and inter-processor communication. 

one transputer, Known as the root transputer, can 
communicate with the host computer in this case an IBM PC. It 
is also used to edit and compile programs and in addition ft 
distributes programs among the other transputers in the system 

which are referred to as the network. 

As stated earlier, each transputer has four communication 

channels which allows it to be linKed with other transputers. 
This flexibility allows a programmer the choice of different 
networ* configurations for instance: a torus, a hypercube, 2-D 
mesh, a pipeline and a binary tree topology. Some of these 
are sketched in Figure 1. Some of the configurations may be 

better than others for certain classes of problems. For the 

^ _ two-dimensional grid 

problems of interest here, a 

configuration was thought to be the most appropriate. 

For this study, an explicit, finite-element program was 
written to analyze two and three dimensional transient 
problems. An explicit-integration algorithm was chosen since 
no equation solving is necessary and different nodes or nodal 
groups can be updated independently. To partition the problem 
for parallel computation, the nodes of the finite-element mesh 
are divided into groups and assigned to different proc 
This partitioning, however, should be done in a way such that 
the amount of interprocessor communication is minimized. 
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The. initial task of the root processor is to define the 
problem data, define the information needed for interprocessor 
communication, and transmit this data to the network 
processors. The task of each of the network processors is to 
update the accelerations, velocities and displacements of the 
nodes in its group over a time step. It should be noted that 
although the updates of the nodal groups are uncoupled, the 
displacements of the nodes in other subdomains at the 
preceding time must be known in order to compute the internal 
forces. Therefore, after the new displacements in a subdomam 
are calculated, they must be communicated to other processors 
before the next update can proceed. To solve a problem most 
efficiently and achieve the greatest speedup over a 
sequential computer, the time used for interprocessor 
communication should be minimized. A flow chart for this 

program is given in Figure 2. 

The problem parameters which are specified on the root 
processor are the nodal coordinates, element connectivity, the 
initial conditions y° and fl", the time step, the number of time 
steps and the data defining the material properties. Because 
the program was written to be run on a 2-D grid processor 
configuration, the processor connectivity is defined by 
giving the number of rows and columns of processors. For 
example, the grid in Fig. lb has 3 rows and 4 columns. Data 
specifying which nodes are assigned to which processors and 
the limit of interprocessor communication, the maximum 


ROOT PROCESSOR 


SYSTEM 


PROCESSOR 



b 


Figure 2 . 


Flow chart of the parallel finite 
element program 
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Figure 2 . Cont . 
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distance of neighbors that a processor can communicate with, 
is also defined. The limit describes the maximum number of 
steps which is required to perform the interprocessor 
communication. In each step only the transputers which are 
connected can exchange data; for example, in Figure lb, 
transputers number 5 and 6 can exchange information in one 
step but transputers number 4 & 6 in two steps. 

The first task of the root processor is to calculate the 
various matrices needed to describe the interprocessor 
communication. These matrices are determined by the structure 
of the finite element mesh, the grouping of the nodes and the 
size and shape of the processor grid. The matrix containing 
information on the number of processors with which each 
transputer has to exchange information is (num. neigh. send, 
num. neigh . rec) and the grid locations of these processors are 
in the matrices ( next . neigh . next . neigh . rec ) . Two other 
matrices ( signal . in and signal . out^ contain information on 
when and from which direction each transputer must send and 
receive data. The root processor also calculates how many 
communication steps are needed for each processor to transmit 
data to its most remote neighbor (Figure 3) . This gives an 
estimate how well the mesh has been partitioned. For 
instance, if one of the transputers needs many more steps than 
the others to reach its most remote neighbor, the partitioning 
of the mesh should be reconsidered. 

As the root transputer transmits problem data, network 
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transputers receive this information. Keep the needed data and 
send the information further to other transputers. Depending 
on the grid position of a transputer, it can receive 
information from west or north direction, see Fig. 3, and send 
it to east or south directions. The way in which data is 
initially distributed among transputers is shown in Figure 4. 

Before a network transputer starts the time integration 
loop, it has to rearrange the problem data. Nodal 
displacements which must be exchanged are grouped together in 
increasing order according to the identification number of the 
transputer that receives data. Grouping displacements in this 
fashion ensures that nodes in the sending and in the receiving 

transputers are rearranged in the same order. 

After calculating mass matrices for each element the time 
stepping is performed. First, the internal force matrix is 
calculated then nodal accelerations, velocities and 
displacements, respectively are updated. After each time 
step, updated nodal displacements are exchanged between the 

network processors. 


19 














5 . 0 NUMERICAL EXAMPLES 


Several numerical examples have been analyzed to evaluate 
the efficiency of the parallel algorithm. Unfortunately, 
since a large system of transputers was unavailable, the 
previously discussed communication algorithm was implemented 
on four T414 transputers connected in pipeline (Figure la) 
were used. In these examples, the size and geometry of the 
finite element mesh was varied to study the effects of 
different amounts of interprocessor communication. Currently 
the program uses linear triangular elements (Figure 5a) in 
two-dimensions and 8 node hexahedrals (Figure 5b) in three- 
dimensions . 


5.1 Three-dimensional Bar Model 

The problem statement for this example is shown in Figure 
6. One end of the bar was kept fixed while an initial 
displacement was applied to the opposite end of the bar. 
Figure 7 gives a plot of the displacement at the end of the 
bar as a function of time that was computed using the central 
difference algorithm. 

To see how the amount of interprocessor communication 
affects the solution time, in this example, the number of 
nodes in the processor groups is varied while the number of 
nodal displacements that must be exchanged between processors 
is kept fixed. When the number of network transputers is 
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nnodoz 


a] using linear triangular elements 



b] using 8 node henafnedral elements 


Figure 5 . 


Finite element model of the bar problem 


2 



Figure 6, Problem statement for the three-dimensional 
bar 
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dlsp lm] x le-9 



Figure 7 . Displacements of the end of the bar as 
a function of time . 
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increased, a mesh of appropriate size is chosen to keep the 
number of processor elements and the number of nodes exchanged 
the same. The notation is used that nnodex, nnodey, nnodez 
indicate the number of nodes in x,y and z direction 
respectively and (nnodex) x (nnodey) x (nnodez) is the total 
number of nodes in the problem. The different cases that were 
considered for various number of processors are given in 
Tables A-l, A-2 and A-3 . The solution times for these cases 
are given in Tables A-4 through A-12. 

From an analysis of the measured solution times and of 
the internal structure of the program it was possible to 
obtain an approximate formula to calculate the execution time. 
The total solution time is assumed to be composed of three 
parts: computation time, communication time and preparation 
time in the form 


T = T + T + T 

■ l tot cp A cm prep 

T„ , - total time 

tot 

T cp - computation time 
T - communication time 

cm 

T p rep - preparation time * 


T 

prep 


N .l C o' + 'Wl + kc 2 + “ , « C 3 + N p,.. C 4 + 
Np.„Nnd C 5 + Np.,l<<V-"nP>/ 2 > C P + 


Va 


lues of constants Cq. 


.C 


11 


are given in Table 4. 
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*(K„d-V C 7 


(5-2) 

(5-3) 

(5-4) 


T =p = ( N p.el C 8 + N up C 9)* N step 

T cm = (C 10 + N 4X C^) *N step 

where 

k - is the integer of the ratio (N,^,./ length) , 

data between transputers is sent in vectors of a 
fixed size, the time of sending & receiving data 
is proportional to k 

length - the size of the vector sending data among 
transputers 

The constants in these formulas were obtained by using the 
least square fit of the data and are given in Table 4 . In the 
cases of computation and communication, times the measured 
values from Tables A-4 through A-12 were used. A different 
approach was taken for the calculation of the preparation 
time. To avoid solving large system of equations, the 
procedure was divided into six parts and approximate formulas 
were obtained for each part. In this case the computation was 
greatly simplified since for each part only two constants have 
to be calculated. The theoretical times obtain from above 
formulas are compared to actual values in Tables A-4 through 

A-12. 

Several conclusions can be drawn from analysis of the 
results. First, the communication time is very small compared 
to the computation time and virtually can be neglected. 
Because data has to be exchanged after each time step the 
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Table 4. 

Constants for calculation of the execution time 



Value[s] 

o 

o 

1.287x10' 4 

C, 

1.366x10* 5 

C 2 

9.224x10' 2 

<=3 

1.462x10' 2 


2.170x10‘ 3 

C 5 

1 ,749xl0 5 

C 6 

2.315x10 3 

C 7 

6.623x10’ 3 

o 

CD 

9.881x10‘ 2 

C 9 

8.735x10* 4 

C 10 

1.280x10* 4 

C 11 

4.640x10‘ 5 
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computation time in each step is determined by the processor 
with the largest work load. Consequently, it is very 
important to distribute the work load uniformly among network 

transputers . 

When assigning the work load to transputers it should be 
taken into account that number of processor elements rather 
than number of nodes to update determines the computation 
time. This conclusion can be drawn from comparison of th 
constants in equation (5-3). The constant corresponding to 
N el (c 8 ) is much larger that the one corresponding to N up (C 9 ) . 

To minimize the number of elements that must be stored by 
more than one processor, the assigning of the nodes to the 
processors should be done along the cross-section plane of the 

mesh with the smallest number of nodes. 

The preparation time can be significant and for a small 
mesh running only a small number of time steps can even 
surpass the advantage of shorter computation time. For a 
given mesh and network of transputers, there exists a minimum 
number of time steps for which running the program 
concurrently is more efficient than sequential computation. 

The above analysis 'shows that the total time depends 
primarily, particularly for larger number of time steps, on 
the number of elements assigned to each processor. The 
computation time is determined by the transputer with the 
largest number of elements, so it is important to assign an 
equal number of elements to all transputers. A simple 
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iterative algorithm was developed for nodal assignment that 
gives a relatively balanced processor work load. 

For example, in the simplest case of 2 transputers we 
have the following relations 


(Np.el)* 

= \l + 

C el 

(5-5a) 

(Np.el>. 

= B el + 

C el 

( 5-5b) 

Nel 

= (N p .el 

>A + ( N p.el ’ C el 

(5-5c) 


< n p..i>a 

(N p .et) B 


A 


el 


B 


el 


C 


el 


total number of elements 
assigned to transputer A 
total number of elements 
assigned to transputer B 
elements assigned only to 
transputer A 

elements assigned only to 
transputer B 

common elements assigned to 
both transputers 


The goal is to achieve (N pel ) A = (N pel )„ or as close as 
possible, what occurs when A el = B el . 

The algorithm assigning nodes in this way is presented in 
Figure 8. However, it should be noted that the results are 
influenced by the manner in which the numbering of elements 
was done. Figure 9 illustrates this for the case of 2-D grid. 
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5.1.1 Analysis of the Results 


Figure 10 shows the plot of the total solution time as a 
function of the number of processors used in solving the cube 
problem; the points were obtained based on the formula (5-1) . 
There are two reasons why the solution time does not decrease 
linearly as the number of processors increases. First, the 
data preparation time on the host computer depends on the size 
of the local processor matrices, which do not decrease 
proportianally to the number of processors, and also on the 
global problem parameters which are constant. So that the 
actual preparation time can increase with increasing numbers 
of processors. Second, the number of elements assigned to each 
transputer is not equal to the total number of elements 
divided by number of processors. It is increased by the 
number of elements which are shared by neighboring processors? 
this number has a constant value. This value depends on the 
shape of the structure being modelled. This dependence is 
illustrated by Figure 11; r represents the ratio A/A cube where 
A . - number of elements in the cross sectional 

area of the cube which has the given number of 
elements . 

— number of elements in the cross sectional 
area of a parallelepiped which has the same 
number elements. 

The three possible shapes corresponding to different values of 
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b] numbering along the longer side, 



Dependence of the maximum num.proc . elem on 
the order of element numbering 
















Figure 


11 . 


Dependence of the shape o£ a parallelepiped 
on the value of the coefficient r . 
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the coefficient r are sketched in Figure 11 and Figure 12 
represents the dependence of the total time on the shape of a 
parallelepiped. Theoretically, the larger the ratio the 
longer the total solution time. However, if r > 1, then the 
nodes have not been assigned to transputers along the shortest 
side. Consequently, if decomposition and numbering of the 
mesh has been done correctly the cube is the worst case. In 
Figure 12 the total time is increasing almost linearly. This 
results from the fact that, in this example, the computation 
time plays the dominating role (relatively large N step =100) . 
The computation time grows approximately linearly with N p el 
because the influence of is very small (see equation 5-3) . 
Since number of elements assigned to each transputer grows 
linearly with r, the total time grows approximately linearly. 

The results in Tables A-4 through A— 12 also indicate that 
the communication time (T cm ) is very small and can be 
neglected. Then, the solution time can be divided into the 
preparation time (T prep ) and the computation time (T cp ) . 
Preparation time is used to set up the problem data for 
parallel computations and would be absent in a sequential 
computation. Here, the transputer efficiency will be defined 
as the ratio of ( [T cp ] se< /p)/T tot . Since the transputer time is 
composed of computation time and data manipulation time, this 
value indicates what parts of the overall time is devoted to 
calculations & data manipulation. The results obtained from 
formula (5-1) are presented in Figure 13. In order to obtain 
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Total time {aj 



Figure 12 . 


Total time as a function of the shape o 
a parallelepiped’ , nun. of . elem=const 


r™A/A 




for furoner eicplanaoion see page 31 



equal values for local parameters; N p .el' N up' N nd' the shape of 
the cross sectional area was kept unchanged, (nnodex) x 
(nnodez) =5x6, while the value of nnodey was adjusted to 
give the required number of elements per processor. If is 
neglected, there are two additional times which occur in 
parallel computations and are absent in sequential. First is 
T pre p which does not depend on N #tep . Second is the part of T cp 
which requires duplicate computations over the elements 
belonging to the division border (these elements are assigned 
to two different transputers) ; this time is proportional to 
N . Three possible cases are illustrated by Figure 13. 

step c 

When T is much larger than the additional computation time, 

prep 

that occurs for small N step and large meshes (Figure 13a) , 
processor efficiency decreases with num.of.elem per processor 
since T prep grows faster than T cp . The second extreme takes 
place when T prep is small compared to additional computation 
time; this is true for large N step (Figure 13c). Because the 
num.of.elem on the division border is kept constant, the 
processor efficiency increases with num.of.elem per processor. 
The third case occurs when both additional times have 
comparable influence (Figure 13b) . 

5.1.2 Estimation of Optimal Number of System Processors 

Another problem of interest is to determine the number of 
transputers in the network that should be used to execute a 
given problem in the shortest time. It should be noted that 
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rocessor efficiency [%] 



Figure 13a. 


Processor efficiency as a function of 
num.of . elem per processor , 

num. time . step=l 


’ Here num.of . elem per processor is defined to be <N el /p) , which means that 
insures that the efficiency is measured w.r... the sequence 




processor efficiency [%] 



num.of.elem per processor 

Figure 13b. Processor efficiency as a function of 
num.of.elem per processor" r 
num .time . step=100 


In this case nor*. of . elen per processor is equal to (N^/p) , what means 
that elements assignee tc mere than one transputer are counted only once. Thus 
insures that the efficiency is measured w.r.t. the sequential camputaticn. 
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processor efficiency [*/•] 



Figure 13c. Processor efficiency as a function of 
nun. of . elea per processor r 
num . t ime . step=l 000 


In this case nut.of.elem per processor is equal to (N^/p) , what means 
that elements assigned to more than one transputer are counted only once. This 
insures that the efficiency is measured w.r.t. the sequential computation. 
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depending on the problem size and the number of time steps, 
using the maximum number of processors may not yield the 
shortest time. To determine the shortest solution time, the 
following inequality needs to be analyzed 

(T ) . > T + T (5-6) 

'■‘■tot'l proc prep ep 

T ) , - total time for sequential 

■‘tot' 1 proc ^ 

computation 

T prep - preparation time for parallel 

computation 


T cp - computation time for parallel 

computation 


For 

the bar 3 -dimensional problem. 

the following 

equations 

can be used to calculate 

num. proc. el em, 

num. nodes . 

needed and num. update. nodes 



h = N nodes/( ab ) 

(5-7a) 


N up = (sab) 

( 5-7b) 


= (s+1) (ab) 

(5-7c) 


N p.ei = (h- 1 )] 

( 5-7d) 

where 

s = [ ( (h-2 ) /p) +1] 

(5-7e) 


p - number of processors 

a, b, h - number of nodes along the sides of the 
parallelepiped 

After substitution into equation (5-6) , we can obtain the 
minimum number of processors which must be used to reduce the 
total time below the total time for one processor. Data for 
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There are two 


few chosen cases are presented in Figure 14 . 
factors which influence the value of the minimum number of 
processors. First, the preparation time increases for smaller 
numbers of processors. Second, the computation time decreases 
inversely with the number of processors. The minimum number 
of processors is the smallest value for which the decrease of 
T p compared to sequential computation is larger than Tp re p. Of 
course, this value is smaller for problems where T cp plays 
dominating role. This occurs in the case of problems with 
large number of time steps because T cp increases linearly with 
N step while T prep remains unchanged, or when the assemble of the 
stiffness matrix requires a relatively long time. Figure 14 
shows that even for a very small value of N step , the minimum 
number of processors is the smallest possible, two, which 
means that two or more processors will give a faster solution 
than a sequential computation. The reason for this is because 
for 8 node hexahedrals time for the computation of the 
stiffness matrix is relatively large, so that even if only a 
few elements are involved this time is greater than the 
communication time. As was stated above, the minimum 
num.of.proc is the smallest value for which decrease of T cp is 
larger than T pre p. Based on this, it can be explained why, for 
small N step , its value is very large. If num.of .proc=2 ; T prep , 
which does not depend on N step and decreases with num.of.proc, 
can greatly exceed T cp . When num.of.proc is increased, the 
gain in T cp in absolute value is relatively small, so large 
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minimum num.of,proc 



Figure 14. Minimum number of processors as a function 
of num . time . step 
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num.of.proc is required to reduce T prep under this small value. 

Equation (5-1) can be used to calculate the approximate 
execution time. However, the local data such as the number of 
update nodes, number of nodes needed, number of processor 
elements are not known before the program is executed since 
they are determined by the procedure performing the partition. 
Good accuracy can be achieved when the procedure calculating 
time is used along with the part of the program which performs 
the decomposition of the mesh and supplies the required data. 
In practice, however, the approximate execution time may be 
needed before input data is prepared. If we suppose that the 
structure has approximately parallelepiped shape, the local 
parameters mentioned above can be estimated. 

For a given problem with N el , number of elements, and 
N J , number of nodes, T tot can be estimated by one of the two 
methods presented below. First, as was shown previously, 
Figure 11, the longest total time for a constant N el occurs in 
the case of a cube. We can assume that the structure is a 
cube composed of number of elements equal to N el and then 
calculate components needed to compute T tot from equations (5- 
7) and (5-1) . In the second approach we can anticipate the 
shape of the parallelepiped using both N el and The sides 

of the parallelepiped can be obtained from the following 

formulas 

N n odes= abh 
N el - (a-1) (b-1) (h-1) 
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(5-8a) 
( 5-8b) 



or after eliminating h 

(b ! -b)a J + b[l-(N nodes -N el ) ]-b z )a + M no *,(b-1)-0 (5-9) 

Since we have one equation and two unknowns, we have to assume 
the ratio a/b. For (a/b)=l the above equation reduces to 

a‘ - 2a 3 + [ I” (N ftodM -N tl ) ] a 2 + 2N noa ,,a - N nodes = 0 (5-10) 

Solving the equation and taking the root 2 JN nodes z a > 0, 

the equations (5-7) and (5-1) can be used to estimate the 
total time. 


5.2 Turbine Blade 

A problem involving the analysis of a turbine blade was 
executed sequentially and on a two transputer network, with a 
mesh of 1575 nodes and 1025 elements (Figure 15) . It was 
assumed that the bottom nodes of the blade are fixed while 
initial displacements are applied to the top nodes. The 
results together with the estimated times are grouped in the 
Tables 5 and 6. 

In Table 5 the sequential and parallel solutions are 
compared. For num. time. step = 1, the sequential solution is 
faster than the concurrent one because the data manipulation 
time exceeds the gain in the computation time resulting from 
using two processors. It also should be noticed that with 
growing num. time, step the ratio (T tot ) 2 proc / (T tot ) , proc becomes 
smaller but never reaches 50%. This is so because more than 
half of the elements are assigned to each processor and thus 
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Table 6. ^Solution times for the turbine problem 
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times estimated by assuming the shape o< a parallelepiped 
















































































































the total work load is larger than in the sequential 
computation. Also, the time is increased by the preparation 
time which is absent in sequential computation. 

The measured times were compared with the calculated 
times in Table 6; the calculation was carried out using the 
two partitioning methods presented in the previous section. 
First, an equal number of nodes were sequentially assigned to 
each transputer, then the nodes were assigned in the manner 
such that each transputer handles an equal number of elements. 
As expected, in the second case the execution time is shorter; 
a reduction of about 10% was achieved. 

In both cases, the calculated times are smaller than the 
measured times. This is because the shape of the turbine 
blade does not exactly correspond to the assumed 
parallelepiped model, and more nodes have to be exchanged 
between processors than it is predicted by the cube model . 
The element partition was made along the blade platform (see 
Figure 15) . 


5 . 3 Two-dimensional Example 

The program was modified slightly in order to analyze 
two-dimensional problems more efficiently. The eight node 
solid element was replaced by a linear displacement triangular 
element which decreased the time of computation and assembly 
of the stiffness matrix. Also, where appropriate, three 
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dimensional vectors and matrices were replaced by two 
dimensional versions. The program was used to analyze several 
two-dimensional bar test problems (Table A-13) and the results 
were presented in Tables A— 14 through A— 19. 

In comparison with the three-dimensional case, the ratio 
of communication time to computation time (Ty/T^) for the 
two-dimensional problems is much larger. Table 7. Both times, 

T and T , are smaller for 2-D problems than for 3-D ones. 
However, the decrease of the computation time is much larger 
primarily because the calculation of the stiffness matrix is 
significantly faster than for the 3 -dimensional element. 

The 2-D problem was also examined by Patrick Smolinski 
on T800 transputer (see page 12) . Some results of his study 

are presented in Tables A-20 and A-21. 

Comparing these results with those for T414 transputer 
(Table A-13 through A-19) , we see that the total computation 
time is smaller for the T800 transputer. This is due to the 
higher performances of the T800 transputer (see page 12). 
However, we cannot calculate the ratio of the communication to 
the calculation time, since the communication time has not 
been measured for these problems. 


51 


Table 7. 


Comparison of the ratio of the communication time per 
node to the computation time per element 


CASE 

per. node 
<V per.elem 

3-D 

1.2 % 

2-D 

4.13 % 
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6.0 CONCLUSIONS 


The results of this study indicate that the execution 
time of a finite element program can be considerably reduced 
by parallel computation using a relatively inexpensive 
transputer system. However, a price is paid since a larger 
and more complex computer program is required. In addition to 
the part of the computer program performing the actual 
calculations, routines performing communication and 
decomposition of the mesh have to be written. 

For cases examined here, two and three-dimensional 
problems, the communication time, the time of exchanging 
displacements after each time step, was very small. However, 
if communication between more remote transputers is required 
this time will increase. In addition, when the time for 
assembling the stiffness matrix is relatively small and the 
number of exchanged nodes is large, the communication time has 
to be taken into account. For the problems solved in this 
study, the preparation time, the time of receiving and 
rearranging data, is more important. This time depends on the 
size of the structure and number of transputers used but not 
on the number of time steps. Consequently, for large number 
of time steps it becomes small in comparison to the 

computation time. 

The total execution time depends on the size of the mesh 
and local parameters such as number of nodes and elements 
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assigned to each transputer. The computation time depends 
mostly on number of elements and other local parameters that 
are related to this value. For problems that require a large 
number of time steps for the solution, the computation time is 
usually much larger than other components of the total time. 
Because this is usually the case when a transputer systems 
would be used, it means that partitioning of the mesh can be 
done considering only the number of elements. For this reason 
perhaps the most important factor in minimizing the solution 
time is to assign an equal work load to each transputer since 
the total time is primarily governed by the transputer with 
the largest work load. 

In most cases the parallel computation proves to be 
faster than the sequential one. The only cases when the 
sequential computation is faster occur for small number of 
time steps and are of no practical importance since most 
engineering problems require a significant number of time 
steps . 

In the future it would be desirable to write a program 
which checks the order of element numbering in the 
connectivity matrix and if necessary rearrange it. This would 
optimize the results of the program performing the 
decomposition of a mesh, see Figure 7. 

In this study, the program was executed on the only 
available transputer configuration, the pipeline (Figure la) . 
The correctness of the program was checked by using a special 
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feature of the OCCAM language which allows to simulate an 
arbitrary transputer network while a program is executed 
sequentially. It would be interesting to run this program on 
a large grid of T800 processors to study the efficiency of the 
algorithm and to determine the speed-up that could be obtained 
with these much faster processors* 


X 
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Solution times for the three-dimensional bar, 
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further description of the cases can be found in Table A-1. 


































































































































































the three-dimensional bar, 
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further discription of the case can be found in Table A-1. 






























































































































































Solution times for the three-dimensional bar 
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Further discretion of the cases can be found in Table A-1 








































































































































































Table A- 7 • 
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Further discretion of the cases can be found in Table A-2 
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Further description of the cases can be found in Table A-2. 
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further discretion of the cases can be found in Table A-2. 



























































































































































Solution times for the three-dimensional bar. 
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Further discretion of the cases con be found in Table A-3. 
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Further description of the cases can be found in Table A-3. 





























































































































































Solution tiroes for the three-dimensional bar, 
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Further description of the cases can be found in Table A*3. 
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Further description of the problem can be found in Table A-13, 
CAS£=0 4 nure.of .proc=2 
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further discription of the problem can be found in Table A-13, 
CASE=1 , nun.of ,proc=2 













































Table A-16. - Solution times for the two-dimensional problem 
(nnodex) x (nnodey) = 40x10, 
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Further discHption of the problem can be found in Table A - 1 3 
CA$E=2 , nun. of ,proc=2 
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Further discretion of the problem can be found in Table A-13, 
CASE=0 f num.of .proc^A 

















































74 


Further description of the problem can be found in Table A-13 
CASE = 1 , non. of ,proc=4 
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Further discretion 
CASE =2 , nun. of ,proc=<i 
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